This data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Guiding Question: Which chemical properties influence the quality of red wines?
Obtain a summary of the data set:
#str(wines)
summary(wines)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
#Clean up and format
wines <- subset(wines, select = - X)
General overview of the variables in figure form.
Residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates appear to have exterme outliers. The Figure can be replicated with these outliers hidden for clearer plotting purposes.
Density and pH have little variance and appear to have normal distribution curves. The dioxide based values, fixed and volatile acidity, sulphates and residual sugar are positive skewed. Sugar and chlorides have are long tailed and thus have extreme outliers. The distribution of the quality ranking are narrow, there are few extreme rankings, most of the rankings are in the middle of the distribution (ratings of 5 and 6).
Create another categorical variable classifying wines as ‘Bad’, ‘Average’ or ‘Good’:
wines$rating<-cut(as.numeric(wines$quality), c(2.5,4.5,6.5,8.5),
labels=c('bad','average','good'))
Fixed acidity is slightly skewed to higher acidity but with a tail extending to a value of 15.9 \(\frac{g}{dm^{3}}\). The mean 7.90\(\frac{g}{dm^{3}}\) and median 8.32 \(\frac{g}{dm^{3}}\) values were quite similar demonstrating the lack of outliers in the distribution. Volatile acidity had a similar distribution to fixed, plotting on a log scale made both appear more normal distributed.
Superimposing the boxplot onto the scatter plots of these acidity values is a good way to summarise the data. From the scatterplot/boxplot volatile.acidity clearly has a more normal distribution as the boxplot is approximately at the centre of the points on the scatterplot.
#summary of fixed acidity
summary(wines$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
#summary of fixed volatile acidity
summary(wines$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Citric acid had a median value of 0.260. Citric acid did not show a normal distribution when plotted on log scale.The most common citric acid value for the wines appears around zero, to be more exact:
For citric acid, 132 wines had zero values, this equates to 8% of the wines, considering the bin values surrounding the zero value are much smaller than 132, this could be indicate a problem with the data collection.
Residual sugar shows a concentration around the median of 2.2\(\frac{g}{dm^{3}}\) with ouliers up to a max of 15.5\(\frac{g}{dm^{3}}\). From research we would expect the great majority of wines between 1 and 4 \(\frac{g}{dm^{3}}\) as red wines are not sweet. The Boxplot is superimposed onto scatter plot to demonstrate the distribution of residal sugar values, the great majority of the values in are in the 1 to 4 range.
#summary of residual sugar
summary(wines$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Chlorides had a normal distribution with a peak at 0.07 \(\frac{g}{dm^{3}}\) and a long tail to 0.61 \(\frac{g}{dm^{3}}\).
#summary of chlorides
summary(wines$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Free and total sulfur dioxide are bunched at lower values. Total sulfur dioxide has a small number of extreme outliers.
#summary of free.sulfur.dioxide
summary(wines$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
#summary of total.sulfur.dioxide
summary(wines$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
From research free sulfur dioxide varies from about 40% to 75% of the total dioxide - before sulphates are added. Lets add this ratio to data set:
wines$free.sulfur.dioxide.ratio <- with( wines, free.sulfur.dioxide / total.sulfur.dioxide)
Reveals a high concentration at a ratio 0.5, maybe winemakers aim to achieve this ratio?
Density and pH has a normal distribution with little variance.
#summary of density
summary(wines$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
#summary of pH
summary(wines$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Sulphates has a similar distribution as residual.sugar and chlorides
#summary of sulphates
summary(wines$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Alcohol has a peak around 9.5, seems to be correlated with sulfur.dioxide.
#summary of alcohol
summary(wines$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
There are 1599 different observations of wines with 11 different wine variables. All the variables were measured as floating numerical values except for the unique identifier X and the categorical variable quality.
The main feature of interest is quality, how is the value of quality determined?
All wines contain sulfur dioxide in various forms, sulphates are added by the winemaker as an additive. For experienced tasters high concentrations of added sulphates can be unpleasant. Residual sugar will be lower in red wine but higher outliers may cause the wine to be too sweet resulting in lower quality ratings. Acidity (fixed, volatile and citric) also plays a big part in wine quality, different combinations of these amounts will lead to a variation in the perceived quality.
I made new variables the ratio of free to total sulfur dioxide and a categorical variable rating of quality.
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
The data was already in a tidy format, i did remove the variable X from the data set as this variable simply represented the row index. Citric acid stood out, 8% of the wines had no citric acid. Citric acid most commonly used in wine is produced by the addition of acid supplements, to boost the wines total acidity, this suggests a significant proportion of the winemakers added no citric acid.
We can use a function to determine the correlation between the variables. The pass each of the variables to the funcion in turn.
cor_test <- function(x, y) {
return(cor.test(x, as.numeric(y))$estimate)
}
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## 0.01373164 -0.12890656 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol
## 0.25139708 0.47616632
There does not appear to be a high correlation value between variables in the data set and quality. The strongest correlation with quality were alcohol (0.476), volatile.acidity (-0.391) sulphates (0.251) and citric acid (0.226). The weakest correlations were residual sugar(0.014), free.sulfur.dioxide(-0.051) and pf(-0.058).
The median alcohol value vs quality shows an increase after a quality value of 5. I suspect this may be due to the ‘fuller’ flavour as the alcohol content increases perceived as higher quality. The trend line shows the realtionship between quality and alcohol content.
## factor(quality): 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## factor(quality): 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## factor(quality): 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## factor(quality): 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## factor(quality): 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## factor(quality): 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Quality and volatile acidity displays a negative correlation, almost as strong a correlation as alcohol vs quality but in the other direction. Volatile.acidity appears to be an unwanted feature in wine, quality increases as it goes down.
The lowest quality wines have a high median value of volatile acidity. From research wine spoilage is legally defined by volatile acidity, largely composed of acetic acid. Higher proportions of acetic acid also lead to unpleasant aromas in wine. Many winemakers seek a low value of acetic acid as this adds to the perceived complexity of a wine. These higher levels of volatile acidity (acetic acid) may therefore negatively impact on the taste and aroma of a wine.
## factor(quality): 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## factor(quality): 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## factor(quality): 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## factor(quality): 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## factor(quality): 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## factor(quality): 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
The correlation between sulphates and quality was 0.25. Not as high as alcohol and volatile acidity but there is a general trend for higher quality wines having higher sulphate values. Sulphates are added by winemakers to prevent oxidation and keep unwanted bacteria at bay. Therefore this is done not to improve the taste but to ensure the wine taste does not degrade, maybe the lower quality wines have started to degrade affecting the perceived quality of the wine. Wines witha quality of 5 or 6 had a large number of outliers which would drive down the correlation value. From the lowest quality factor the median sulphate value increases until a quality factor of 7 and 8 where it stays the same.
## factor(quality): 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## factor(quality): 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## factor(quality): 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## factor(quality): 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## factor(quality): 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## factor(quality): 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
The correlation between citric acid and quality was 0.23. The median value goes from 0.09 at quality of 4 to 0.42 for a quality value of 8. The Lower quality values have large citric acid outliers while the higher quality wines tend to have more similar median and mean values; The box plot makes this much clearer.
The overall trend is higher citric acid values at higher quality wines.
## factor(quality): 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## factor(quality): 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## factor(quality): 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## factor(quality): 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## factor(quality): 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## factor(quality): 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
##
## Pearson's product-moment correlation
##
## data: wines$volatile.acidity and wines$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
There is a strong correlation between volatile acidity and citric acid. Maybe a more complex relationship between acids in wine?
We can see more clearly the correlations between quality, alcohol and volatile.acidity more clearly on one plot. Higher quality wines have higher alcohol and lower volatile.acidity.
Lets now add sulphates and replot. Higher quality wines have higher alcohol (x-axis), lower volatile acidity (y-axis) and more sulphates (hue)
Question: Can we use our variables to predict quality? Using all the variables, linear regression can be attempted to predict the quality of wine. From the summary data, the coefficient of determination indicates the fitted line will miss many points. This is verified from the residual data plot vs the fitted line.
##
## Calls:
## wines_model: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = wines)
##
## ====================================
## (Intercept) 21.965
## (21.195)
## fixed.acidity 0.025
## (0.026)
## volatile.acidity -1.084***
## (0.121)
## citric.acid -0.183
## (0.147)
## residual.sugar 0.016
## (0.015)
## chlorides -1.874***
## (0.419)
## free.sulfur.dioxide 0.004*
## (0.002)
## total.sulfur.dioxide -0.003***
## (0.001)
## density -17.881
## (21.633)
## pH -0.414*
## (0.192)
## sulphates 0.916***
## (0.114)
## alcohol 0.276***
## (0.026)
## ------------------------------------
## R-squared 0.4
## adj. R-squared 0.4
## sigma 0.6
## F 81.3
## p 0.0
## Log-likelihood -1569.1
## Deviance 666.4
## AIC 3164.3
## BIC 3234.2
## N 1599
## ====================================
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.197e+01 2.119e+01 1.036 0.3002
## fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357
## volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 ***
## citric.acid -1.826e-01 1.472e-01 -1.240 0.2150
## residual.sugar 1.633e-02 1.500e-02 1.089 0.2765
## chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 ***
## free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 *
## total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 ***
## density -1.788e+01 2.163e+01 -0.827 0.4086
## pH -4.137e-01 1.916e-01 -2.159 0.0310 *
## sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 ***
## alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
P-values for fixed acidity, citric acid, residual sugar and density are greater then 5%, this indicates they may not make a significant contribution to our hypothesis with 95% confidence. This contradicts the correlation results for citric acid - maybe there is a more complex relationship between citric acid and quality and other acidic variables than at first thought. The R-squared value is 0.36 but the p-value is less than 5% indicating that the variables do have impact on the quality rating. The residual plot also shows higher deviations from the fitted curve line at lower and higher quality ratings.
A linear model can therefore be summarized as:
quality = 21.965 + 0.025fixed.acidity - 1.084volatile.acidity - 0.183citric.acid + 0.016residual.sugar-1.874chlorides + 0.004free.sulfur.dioxide - 0.003total.sulfur.dioxide - 17.881density- 0.414pH + 0.916sulphates + 0.276alcohol
Using real values of the viable we can test our predictive model:
#using these variables expect a value around 3
predict.lm(wines_model, data.frame( alcohol=9.0, chlorides=0.074, free.sulfur.dioxide=10.0, pH=3.25, sulphates=0.57, total.sulfur.dioxide=47,volatile.acidity=0.58, citric.acid=0.66,residual.sugar=2.2,density=1,
fixed.acidity=11.6), type="r")
## 1
## 5.076092
#using these variables expect a value of 5
predict.lm(wines_model, data.frame( alcohol=10.3, chlorides=0.069, free.sulfur.dioxide=9.0, pH=3.3, sulphates=1.2, total.sulfur.dioxide=23,volatile.acidity=0.66, citric.acid=0.22,residual.sugar=2.2,density=0.99,
fixed.acidity=7.7), type="r")
## 1
## 6.150106
#using these variables expect a value of 7
predict.lm(wines_model, data.frame( alcohol=11.5, chlorides=0.083, free.sulfur.dioxide=21.0, pH=3.7, sulphates=0.71, total.sulfur.dioxide=59,volatile.acidity=0.66, citric.acid=0.39,residual.sugar=3.2,density=0.99,
fixed.acidity=9.8), type="r")
## 1
## 5.813426
The model predictions are skewed towards our middle quality rating (5 and 6). This agrees with the frequency count for the quality variable - a heavy bias towards the middle quality rating values.
Wine Quality is displayed in histogram form as Figure 1. Values are heavily biased towards the median values. Few wines are categorized as having a wine quality of 3 or 8. In total there were 1599 observations in the data set of these 1319 were rated at a quality of 5 or 6, this accounts for 82% of the wine quality ratings. Only 10 and 18 wines were rated at 3 or 8 respectively, only 1.8% of the total. The median wine scores have a much lighter blue colour on the colour heat map, but low and high quality wine quality scores were a much darker shade of blue.
Figure 2 demonstrates the relationship between alcohol, sulphates, volatile acidity and citric acid versus wine rating as boxplots. Wine rating summarises the wine quality from 3 to 8 as bad, average and good. Alcohol percent by volume with a median around 11.5 shows the highest rating of good. In this plot bad and average ratings show a simlar alcohol content. Though average rated wines had a significant number of outliers for greater alcohol content, suggesting the rating of these wines were boosted by having higher alcohol content. Sulphates show a upwards trend of increasing sulphate conent with wine rating. As witnessed in the alcohol boxplot, the average rating for sulphates shows the highest number of outliers with increasing sulphate content.
A Citric acid vs wine rating shows a general trend upwards with citric acid content. Volatile acidity demonstrates a negative relationship with wine rating.
Figure 3 shows the relationship between alcohol and volatile acidity with wine rating. The ‘good’ points tend to be clustered at points in the figure with higher alcohol and lower volatile acidity values. The data set in wine quality terms is unbalanced with many average wines but few with poor or good ratings.
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). My goal was to determine which chemical properties influence the quality of red wines. Each variable was investigated and then the relationships between these variables uncovered by correlation. A linear model was built using the variables and comparison was made between the linear model and actual values.
By correlation the strongest chemical properties affecting quality appeared to be alcohol, supplicates, citric acid and volatile acidity. The linear model also picked out alcohol, sulphates and volatile acidity as the most likely determining factors of quality. Though the linear model did not demonstrate a strong relationship between quality and citric acid. The narrowness of the quality rating values; relatively few at lower and higher ratings meant the linear model was biased towards the average rated wines.
The analysis was carried out from Portuguese wines, it would be interesting to see if similar results were found using wines from another country such as Spain which has a similar geography.